Scientific computing with non-standard floating point types
نویسنده
چکیده
Faculty of Engineering, Mathematics and Science Department of Computer Science and Statistics Master of Science in Computer Science Scientific computing with non-standard floating point types May 2015 Author: Vlăduţ Mădălin Druţa Supervisor: Dr. David Gregg This study examined the possible use of non-standard floating point types for scientific computing. The question of this thesis is: “Is there anything to be gained by supporting non-standard floating point data types?”. There are several gaps in the literature that this thesis will aim to address. There could exist potential in the use of non-standard floating point types. This thesis investigates in particular the non-standard floating point type of 48-bit size. As long as there is no need for the full precision of floating point standard size of 64, the 48-bit non-standard type requires less memory, reduces the amount of data movement and might be faster than the standard size of 64-bit. The initial findings showed that the non-standard (f48-bit) without the use of Streaming SIMD (Single Instruction Multiple Data) Extensions (SSE) is slower than using the standard 64 bit floating point. However, using SSE intrinsics the non-standard 48-bit floating point is competitive with the standard 64-bit. The results shown are good for a floating-point type that is not supported in hardware.
منابع مشابه
Optimal design of fixed-point and floating-point arithmetic units for scientific applications
The challenge in designing a floating-point arithmetic co-processor/processor for scientific and engineering applications is to improve the performance, efficiency, and computational accuracy of the arithmetic unit. The arithmetic unit should efficiently support several mathematical functions corresponding to scientific and engineering computation demands. Moreover, the computations should be p...
متن کاملAccurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest
In this Part II of this paper we first refine the analysis of error-free vector transformations presented in Part I. Based on that we present an algorithm for calculating the rounded-to-nearest result of s := ∑ pi for a given vector of floatingpoint numbers pi, as well as algorithms for directed rounding. A special algorithm for computing the sign of s is given, also working for huge dimensions...
متن کاملFloating Point Unit Generation and Evaluation for FPGAs
Floating point units form an important component of many reconfigurable computing applications. The creation of floating point units under a collection of area, latency, and throughput constraints is an important consideration for system designers. Given the range of possible tradeoffs, most commercial or academic floating point libraries for FPGAs provide a small fraction of possible floating ...
متن کاملA Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing
Advances in field programmable gate arrays (FPGAs), which are the platform of choice for reconfigurable computing, have made it possible to use FPGAs in increasingly many areas of computing, including complex scientific applications. These applications demand high performance and high-precision, floating-point arithmetic. Until now, most of the research has not focussed on compliance with IEEE ...
متن کاملA Distillation Algorithm for Floating-Point Summation
The addition of two or more floating-point numbers is fundamental to numerical computations. This paper describes an efficient “distillation” style algorithm which produces a precise sum by exploiting the natural accuracy of compensated cancellation. The algorithm is applicable to all sets of data but is particularly appropriate for ill-conditioned data, where standard methods fail due to the a...
متن کامل